Bayesian Reaction Optimization Using EDBO - Part III

17 minute read

Published: October 04, 2020

Part III - Bayesian Reaction Optimization

In part I we installed the pre-release of EDBO and ran basic functionality tests. In part II we got a handle on some simple use cases for the software. In this post, we will see how to apply EDBO to chemical reaction optimization problems by automatically encoding a search space for a new problem and running a round of human-in-the-loop optimization. To start, we need to define exactly what problem we are trying to solve. Specifically we need to identify: (1) the objective, (2) where to search, and (3) our experimental capabilities.

The objective

The objective is what are you trying to optimize and as such it can have a big impact on your results. For example, if you are optimizing an enantioselective transformation you may be considering either enantiomeric excess or enantiomeric ratio as your target. In this context, enantiomeric ratio (or inferred $\Delta\Delta G^{TS}$) is preferable due to the established physical relationship between the ratio of enantiomers and the relative rates of the selectivity determining step. However, with this objective it is conceivable that the reaction could be optimized to deliver >90% ee but only 5% yield (and I have seen exactly this happen in some projects). Never the less, the optimizer would have very effectively carried out the task it was presented with.This in tern leads to the challenging (and ever present) situation in which perturbing the conditions to optimize yield can lead to a degradation in ee. Thus, in the context of single objective optimization it may be preferable to redefine your objective in a way that captures both objectives. For example, you could instead optimize the yield of the major enantiomer (inferred from the selectivity and the overall yield).

Where to search?

After we have identified our objective, we then need to define exactly what subset of plausible experiments (all possible catalysts, reagents, concentrations, temperatures, etc.) we should consider. This is a critical step because the identity of search space will ultimately determine the success of the campaign - if the space is too small we lower the chances of finding optimal conditions; if the search space is too large it may take longer to discover optimal conditions and modeling will require more computational resources. There are a number of approaches to this problem including simply drawing conditions from the literature, based on physical knowledge, and our experience or utilizing a less biased data driven approach such as unsupervised learning to select a subset of experiments which spans the larger search space. It general, I believe that we should utilize both approaches as a way to include conditions which fit our knowledge (a kind of chemists prior) and leave open the possibility for unexpected results. In order to stay on topic in this post I will not present any further discussion at this time. However, keep in mind that whatever our approach it is worth putting in the effort to (1) search the chemical literature for related reactions (e.g., SciFinder, Reaxys, etc.), (2) employ our knowledge and training as synthetic chemists, and (3) consider a broader hypothesis space that is less biased by our experience. For example, if you we optimizing a Pd-catalyzed cross-coupling reaction we will want to include tried-and-tested ligands such as Buchwald-type phosphines as well as sample other diverse structures.

Experimental capabilities

Prior to starting an optimization campaign we should consider our experimental capabilities. For example, we need to decide the number of experiments to run in parallel (if any). It is also worth while to take the time to collect initial data using “obvious” experimental conditions as a baseline (e.g., if you are running a Ni/photoredox reaction, try dtbbpy as a ligand first). Of course, if you are interested in optimizing a reaction with EDBO you have likely already carried out initial experiments to test your hypothesis. Once we have initial data, it can be helpful to establish the error associated with the experiments by running duplicates and/or making a calibration curve. This information will give us an idea of what a meaningful improvement in the objective is and potentially enable us to calibrate the optimizers noise parameters.

Reaction optimization

OK, now that we have gotten the philosophical exposition out of the way we can get down to business and actually demonstrate the use of EDBO for reaction optimization. Let’s assume that we have already identified the objective, search space, and our experimental capabilities. We can then instantiate an optimizer that fits these decisions. The objective is defined by the signal you feed into the optimizer - here we use %yield of desired product. The experimental capabilities can be defined using key word arguments (e.g., batch_size=5 tells the optimizer you can run 5 experiments at a time). However, communicating the selected search space to the optimizer requires us to decide on a reaction representation.

Encoding the reaction space

We will represent our reaction(s) using a numerical encoding. This encoding could include ordered numerical dimensions (temperature, concentration, etc.), unfeaturized categorical descriptors (one-hot-encoding), and features which contain physical or chemical information about the experiments in the reaction space. EDBO has complete flexibility with respect to how you define the encoded search space.

Build it yourself: Of course the most flexible (and laborious) option is to build the search space yourself as a pandas DataFrame.

import pandas as pd
from edbo.bro import BO
from edbo.utils import Data

# Define dimensions
concentrations = [0.1, 0.2, 0.3, 0.4, 0.5]
temperatures = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Simple loop to build combinations
search_space = []
for c in concentrations:
    for t in temperatures:
        search_space.append([c,t])

domain = Data(pd.DataFrame(search_space, columns=['C', 'T']))

# The built in GP model requires normalization
domain.standardize(scaler='minmax', target=None)

# Instantiate BO object
bo = BO(domain=domain.data)

The models normalized domain can be accessed via the objective module:

bo.obj.domain.head()

	C	T
0	0	0
1	0	0.1
2	0	0.2
3	0	0.3
4	0	0.4

And the unnormalized domain can be accessed from the data container:

domain.base_data.head()

	C	T
0	0.1	0
1	0.1	10
2	0.1	20
3	0.1	30
4	0.1	40

The flexibility can be great if you are building a custom search space (e.g., with computed interaction terms). However, once we start to include chemical descriptors for categorical components this process can take a bit of work. In many cases, we can actually have EDBO build the search space automatically. Functions for building the search space can be found in the edbo.feature_utils module (e.g., reaction_space). We can check out the documentation using help(reaction_space).

Help on function reaction_space in module edbo.feature_utils:

reaction_space(component_dict, encoding={}, descriptor_matrices={}, clean=True, decorrelate=True, decorrelation_threshold=0.95, standardize=True)
    Build a reaction space object form component lists.
    
    Parameters
    ----------
    reaction_components : dict
        Dictionary of reaction components of the form: 
                
        Example
        -------
        Defining reaction components ::
        
            {'A': [a1, a2, a3, ...],
             'B': [b1, b2, b3, ...],
             'C': [c1, c2, c3, ...],
                        .
             'N': [n1, n2, n3, ...]}
            
        Components can be specified as: (1) arbitrary names, (2) chemical 
        names or nicknames, (3) SMILES strings, or (4) numeric values.
    encodings : dict
        Dictionary of encodings with keys corresponding to reaction_components.
        Encoding dictionary has the form: 
                
        Example
        -------
        Defining reaction encodings ::
                
            {'A': 'resolve',
             'B': 'ohe',
             'C': 'smiles',
                  .
             'N': 'numeric'}
            
        Encodings can be specified as: ('resolve') resolve a compound name 
        using the NIH database and compute Mordred descriptors, ('ohe') 
        one-hot-encode, ('smiles') compute Mordred descriptors using a smiles 
        string, ('numeric') numerical reaction parameters are used as passed.
        If no encoding is specified, the space will be automatically 
        one-hot-encoded.
    descriptor_matrices : dict
        Dictionary of descriptor matrices where keys correspond to 
        reaction_components and values are pandas.DataFrames.
            
        Descriptor dictionary has the form: 
                
        Example
        -------
        User defined descriptor matrices ::
                
            # DataFrame where the first column is the identifier (e.g., a SMILES string)
                
            A = pd.DataFrame([....], columns=[...])
                
            --------------------------------------------
            A_SMILES  |  des1  |  des2  | des3 | ...
            --------------------------------------------
                .         .        .       .     ...
                .         .        .       .     ...
            --------------------------------------------
                
            # Dictionary of descriptor matrices defined as DataFrames
                
            descriptor_matrices = {'A': A}
            
        Note
        ----
        If a key is present in both encoding and descriptor_matrices then 
        the descriptor matrix will take precedence.
    
    clean : bool
        If True, remove non-numeric and singular columns from the space.
    decorrelate : bool
        If True, iteratively remove features which are correlated with selected
        descriptors.
    decorrelation_threshold : float
        Remove features which have a correlation coefficient greater than
        specified value.
    standardize : bool
        If True, standardize descriptors on the unit hypercube.
    
    Returns
    ----------
    edbo.utils.Data
        Reaction space data container.

Automatic encoding with EDBO: You may have noticed in part II that we didn’t have to build search space DataFrame in our examples. This is because the BO_express class automates this step. Let’s check out an example. Suppose you are interested in optimizing a Mitsunobu reaction and you identify azadicarboxylate, phosphine, equivalents, solvent, substrate concentration, and temperature as important parameters. An example reaction scheme is below.

reaction

All we need to define in order to automatically encode our reaction is a dictionary of components and a dictionary of the desired encodings. Using BO_express in this way will encode, clean, decorrelate, and normalize the reaction space.

from edbo.bro import BO_express

# (0) Define possible settings of each reaction parameter

azadicarboxylate_SMILES = ['CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C',
                           'CC(OC(/N=N/C(OC(C)C)=O)=O)C',
                           'ClC1=CC=C(COC(/N=N/C(OCC2=CC=C(Cl)C=C2)=O)=O)C=C1',
                           'O=C(OCC1=CC=CC=C1)/N=N/C(OCC2=CC=CC=C2)=O',
                           'O=C(N1CCCCC1)/N=N/C(N2CCCCC2)=O',
                           'CN(C(/N=N/C(N(C)C)=O)=O)C']

phosphine_SMILES = ['C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1',
                    'CCCCP(CCCC)CCCC',
                    'CP(C1=CC=CC=C1)C2=CC=CC=C2',
                    'COC1=CC=C(P(C2=CC=C(OC)C=C2)C3=CC=C(OC)C=C3)C=C1',
                    'FC1=CC=C(P(C2=CC=C(F)C=C2)C3=CC=C(F)C=C3)C=C1',
                    'FC(F)(F)C1=CC=C(P(C2=CC=C(C(F)(F)F)C=C2)C3=CC=C(C(F)(F)F)C=C3)C=C1',
                    'CC1=CC=C(P(C2=CC=C(C=C2)C)C3=CC=C(C=C3)C)C=C1',
                    'C1(P(C2CCCCC2)C3CCCCC3)CCCCC1',
                    'C1(P(C2=CC=CO2)C3=CC=CO3)=CC=CO1',
                    'CC1=CC=C(P(C2=CC=CC=C2)C3=CC=CC=C3)C=C1',
                    'ClC1=CC=C(P(C2=CC=C(Cl)C=C2)C3=CC=C(Cl)C=C3)C=C1',
                    'CN(C1=CC=C(P(C2=CC=CC=C2)C3=CC=CC=C3)C=C1)C']

solvent_NAMES = ['dimethyl formamide', 'tetrahydrofuran', 'acetonitrile', 'toluene', 'ethyl acetate']

substrate_concentrations = [0.05, 0.10, 0.15, 0.20]
azadicarb_equivs = [1.1, 1.3, 1.5, 1.7, 1.9]
phos_equivs = [1.1, 1.3, 1.5, 1.7, 1.9]
temperatures = [5, 15, 25, 35, 45]

# (1) Define a dictionary of components

components = {'azadicarboxylate':azadicarboxylate_SMILES,
              'phosphine':phosphine_SMILES,
              'solvent':solvent_NAMES,
              'substrate_concentration':substrate_concentrations,
              'azadicarb_equiv':azadicarb_equivs,
              'phos_equiv':phos_equivs,
              'temperature':temperatures}

# (2) Define a dictionary of desired encodings
#     Note: if an encoding is not specified OHE will be used

encoding = {'substrate_concentration':'numeric',
            'azadicarb_equiv':'numeric',
            'phos_equiv':'numeric',
            'temperature':'numeric'}

# (3) Instatiate BO_express to automatically build the reaction space

bo = BO_express(components, 
                encoding,
                batch_size=5,
                target='yield')

We can use EDBO utilities to visualize the azadicarboxylates and phosphines.

from edbo.chem_utils import ChemDraw

for SMILES in [azadicarboxylate_SMILES, phosphine_SMILES]:
    cdx = ChemDraw(SMILES, 6)
    cdx.show()

One-hot-encoding categorical features: This gives our search space 180,000 possible experiments and a 27 dimensional encoding.

bo.obj.domain.head()

	azadicarboxylate=CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	…
0	1	…
1	1	…
2	1	…
3	1	…
4	1	…

bo.reaction.base_data[bo.reaction.index_headers].head()

	azadicarboxylate_index	phosphine_index	solvent_index	substrate_concentration_index	…
0	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	dimethyl formamide	0.05	…
1	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	dimethyl formamide	0.05	…
2	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	dimethyl formamide	0.05	…
3	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	dimethyl formamide	0.05	…
4	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	dimethyl formamide	0.05	…

There are also built-in methods to automatically generate cheminformatics reaction encodings and read in precomputed descriptor matrices for each component (we won’t cover this topic here but this is how you can use hand curated or DFT based descriptors). For example, let’s say you wanted to use Mordred fingerprints to encode all of the reaction components. This is easy to do by either using SMILES strings or chemical names for the reaction components.

encoding = {'azadicarboxylate':'mordred',          # Compute mordred descriptors
            'phosphine':'mordred',                 
            'solvent':'resolve',                   # Search the NIH database for the structure
            'substrate_concentration':'numeric',
            'azadicarb_equiv':'numeric',
            'phos_equiv':'numeric',
            'temperature':'numeric'}

bo = BO_express(components, 
                encoding,
                batch_size=5,
                target='yield')

SMILES strings and chemical names: This gives our search space 180,000 possible experiments and a 644 dimensional encoding.

bo.obj.domain.head()

	azadicarboxylate_ABC	azadicarboxylate_SpMax_A	azadicarboxylate_VR2_A	…
0	0.350891	0.64384	0.194388	…
1	0.350891	0.64384	0.194388	…
2	0.350891	0.64384	0.194388	…
3	0.350891	0.64384	0.194388	…
4	0.350891	0.64384	0.194388	…

bo.reaction.base_data[bo.reaction.index_headers].head()

	azadicarboxylate_SMILES_index	phosphine_SMILES_index	solvent_SMILES_index	substrate_concentration_index	…
0	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	CN(C)C=O	0.05	…
1	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	CN(C)C=O	0.05	…
2	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	CN(C)C=O	0.05	…
3	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	CN(C)C=O	0.05	…
4	CC(C)(OC(/N=N/C(OC(C)(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	CN(C)C=O	0.05	…

Dealing with errors: An obvious question here is what happens when EDBO cannot generate an encoded search space due to an error? Importantly, the BO_express class also attempts to handle exceptions due to erroneous SMILES strings or chemical names that can not be resolved by spawning an bot to help.

Name can not be resolved:

from edbo.bro import BO_express
            
components={'aryl_halide':['chlorobenzene','iodobenzene','bromobenzene', 'NOT-A-REAL-MOLECULE'],
            'base':['DBU', 'MTBD', 'potassium carbonate', 'potassium phosphate'],
            'solvent':['THF', 'Toluene', 'DMSO', 'DMAc'],
            'ligand': ['c1ccc(cc1)P(c2ccccc2)c3ccccc3',
                       'C1CCC(CC1)P(C2CCCCC2)C3CCCCC3',
                       'CC(C)c1cc(C(C)C)c(c(c1)C(C)C)c2ccccc2P(C3CCCCC3)C4CCCCC4']}
    
encoding={'aryl_halide':'resolve',
          'base':'ohe',
          'solvent':'resolve',
          'ligand':'smiles'}

bo = BO_express(components, encoding)

Spawns a dialog with edbo bot:

edbo bot: For help try BO_express.help() or see the documentation page.

edbo bot: Building reaction space...

edbo bot: The following names could not be converted to SMILES strings:
(3)   NOT-A-REAL-MOLECULE

edbo bot: Would you like to enter a SMILES string or one-hot-encode this component?
~  Use ohe instead

edbo bot: OK one-hot-encoding aryl_halide...

SMILES error:

from edbo.bro import BO_express
            
components={'aryl_halide':['chlorobenzene','iodobenzene','bromobenzene'],
            'base':['DBU', 'MTBD', 'potassium carbonate', 'potassium phosphate'],
            'solvent':['THF', 'Toluene', 'DMSO', 'DMAc'],
            'ligand': ['c1ccc(cc1)P(c2ccccc2)c3ccccc3',
                       'C1CCC(CC1)P(C2CCCCC2)C3CCCCC3',
                       'NOT-A-REAL-SMILES-STRING']}
    
encoding={'aryl_halide':'resolve',
          'base':'ohe',
          'solvent':'resolve',
          'ligand':'smiles'}

bo = BO_express(components, encoding)

Spawns a dialog with edbo bot:

edbo bot: For help try BO_express.help() or see the documentation page.

edbo bot: Building reaction space...

edbo bot: Mordred failed to encode one or more SMILES strings in ligand. Would you like to one-hot-encode instead?
~  no

edbo bot: Identifying problematic SMILES string(s)...

edbo bot: Mordred failed with the following string(s):
(2)   NOT-A-REAL-SMILES-STRING

edbo bot: ligand was removed from the reaction space. Resolve issues with SMILES string(s) and try again.

Running the optimizer

OK, back to the Mitsunobu reaction. Now that we have defined the search space we can start the optimization. We will follow exactly the same procedure that we did in post I. First, let’s gather some initial experimental data (tn this case I wouldn’t advise using clustering to select points - since the search space is a 180000 x 644 dimensional matrix and kmeans is instance based this will require quite a bit of memory).

bo.init_sample(seed=0)               # Initialize with a random sample
bo.export_proposed('round0.csv')     # Write the proposed experiments to a CSV
bo.get_experiments()                 # Get a DataFrame of the proposed experiments

	azadicarboxylate_SMILES_index	phosphine_SMILES_index	…
91036	O=C(OCC1=CC=CC=C1)/N=N/C(OCC2=CC=CC=C2)=O	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	…
59238	CC(OC(/N=N/C(OC(C)C)=O)=O)C	CN(C1=CC=C(P(C2=CC=CC=C2)C3=CC=CC=C3)C=C1)C	…
149078	O=C(N1CCCCC1)/N=N/C(N2CCCCC2)=O	CN(C1=CC=C(P(C2=CC=CC=C2)C3=CC=CC=C3)C=C1)C	…
31550	CC(OC(/N=N/C(OC(C)C)=O)=O)C	C1(P(C2=CC=CC=C2)C3=CC=CC=C3)=CC=CC=C1	…
155877	CN(C(/N=N/C(N(C)C)=O)=O)C	CP(C1=CC=CC=C1)C2=CC=CC=C2	…

Next, we would go into the lab and run the proposed experiments. You can also save your workspace and load it later when you are ready for the next round.

bo.save()

Since this is an example we will “run” the experiments by just assigning random values to the experiment yields. Once we have experimental data we can load the results and use EDBO to select the next round of experiments.

bo.load()                            # Load BO (if you exited)
bo.add_results('round0.csv')         # Add results via CSV
bo.run()                             # Fit model and optimize acquisition function

The random results corresponding to the initial experiments are the following:

bo.obj.results_input()

	…	substrate_concentration	azadicarb_equiv	phos_equiv	temperature	yield
external0	…	0	0.25	0.5	0.25	5
external1	…	0.333333	1	0.5	0.75	10
external2	…	0	0.75	0	0.75	40
external3	…	0	0.5	0	0	0
external4	…	1	0	0	0.5	3

In order to select the next experiments the optimizer has to fit the model and optimize the acquisition function. This is a pretty computationally expensive venture and the computation time scales with the size of the search space. For this example, it took 115 seconds on my laptop to select the next batch of 5 experiments. Next, we may want to run some preliminary analysis.

Check out the model’s fit: Notice that the model does not fit the data perfectly. This is because there is not enough data to overcome the surrogate model priors when computing the posterior predictive distribution.

bo.model.regression()

fit

Model mean and variance for selected experiments: We can reason about how exploitative (high posterior mean) exploratory (high posterior variance) the model is for the selected points. In this example notice that the first point selected by the optimizer does not have the highest predicted yield. Rather the posterior variance is large - these figures give a 95% prediction interval ($2\sigma$) including a maximum yield of ~58%.

bo.acquisition_summary()

	…	substrate_concentration	azadicarb_equiv	phos_equiv	temperature	predicted yield	variance
134104	…	0	1	0	1	18.8457	377.263
129604	…	0	1	0	1	21.0823	307.17
141624	…	0	1	1	1	16.7992	337.949
136999	…	1	1	1	1	21.7643	242.817
147504	…	0	0	0	1	21.9171	242.673

Plot the acquisition function: If we dig into the acquisition module we can also plot the acquisition function projection which results in the selection of each experiment.

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(5,1, figsize=(10, 10), )

i = 0
for p in bo.acq.function.projections:
    
    # Acquisition Function on X indices
    ax[i].plot(range(len(p)), p, color='C' + str(i))
    ax[i].set_ylabel('EI' + str(i))
    
    # ArgMax
    top = np.argmax(p).flatten()[0]
    ax[i].scatter([top], [p[top]], s=100, color='black')
    
    i += 1

plt.xlabel('X')
plt.show()

acq

Finally, we can export the proposed experiments to collect the next round of data, rinse and repeat.

bo.export_proposed('round1.csv')     # Write the proposed experiments to a CSV

Up next

In the final post on Bayesian reaction optimization using EDBO, we will investigate how you can use EDBO for your own research! Whether you are interested in optimizing the yield or selectivity of a reaction, finding a new ligand structure, or choosing the best model architecture for modeling chemical reaction data you can use Bayesian optimization to help find an optimal configuration for your task.

Share on

Twitter Facebook LinkedIn

Bayesian Reaction Optimization Using EDBO - Part IV

1 minute read

Published: October 06, 2020

Part IV - Bayesian Reaction Optimization Workshop

Bayesian Reaction Optimization Using EDBO - Part II

10 minute read

Published: October 01, 2020

Part II - Software introduction

In part I we installed the pre-release of EDBO and ran some basic functionality tests. Now in part II we can dive into a basic introduction to using the software. In this post we provide example code for Bayesian optimization of a 1D objective which can be used to explore some of the softwares features. The main Bayesian optimization program is accessed through the edbo.bro module. The main BO classes, edbo.bro.BO and edbo.bro.BO_express, enable users to select initial experiments with experimental designs, running BO on human-in-the-loop or computational objectives, model data, and analyze results. Note: BO parameters are preset to those optimized for DFT encodings in the paper. However, BO_express attempts to automate the selection of priors based on the search space. In general, the BO class is more flexible but as a result less user friendly. Therefore let’s use the BO_express class in this demonstration.

To start we need to define a search space and an objective. In general, for any application it is up to us to define where to optimizer will search for conditions that maximize our objective. For a reaction your objective may be the yield of desired product, here I am using an arbitrary function so feel free to change it to anything you want for this demo.

Define Objective and Search Space

import numpy as np
import matplotlib.pyplot as plt

# Define a computational objective
# EDBO works with feature vectors so even a 1D objective needs to be vectorized

def f(x):
    """Noise free objective."""
    
    return np.sin(x[0]) * x[0] * 5 + 30

def g(x):
    """With noise."""
    
    return f(x) + (np.random.random() - 0.5) * 15
  
# BO uses a user defined domain

X = np.linspace(0,10,1000)    # Grid of 1000 points between 0 and 10

Now we can use matplotlib to visualize the objective.

sample = np.random.choice(X, 100)
plt.figure(figsize=(5,5))
plt.plot(X, [f([x]) for x in X])
plt.scatter(sample, [g([x]) for x in sample], alpha=0.5)
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('"Unknown" Objective')
plt.show()

Using EDBO

With our search space prepared we can now use EDBO to choose initial experiments, evaluate models, and run Bayesian optimization. There are several ways in which the main BO methods can be used. Let’s start by checking out the options when instantiating BO objects. Here is a link to the documentation page: edbo.bro.

First, as we are checking out some of EDBO’s features it will be handy to have nice plotting function.

# Handy function to visualize the results

def map_corr(df):
    """Get corresponding points in unstandardized domain."""
    
    index = []
    for x in df.values:
        i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
        index.append(i)
    
    return bo.reaction.get_experiments(index)

def plot_results(export_path=None, plot_samples=True):
    """Plot summary of 1D BO simulations."""

    mean = bo.obj.scaler.unstandardize(bo.model.predict(bo.obj.domain))                             # GP posterior mean
    std = np.sqrt(bo.model.variance(bo.obj.domain)) * bo.obj.scaler.std * 2                         # GP posterior standard deviation
    next_points = bo.reaction.get_experiments(bo.proposed_experiments.index.values).copy()          # Next points proposed by BO
    next_points['g(x)'] = [f(x) for x in next_points.values]
    results = map_corr(bo.obj.results.drop('g(x)', axis=1))                                         # Results for known data
    results['g(x)'] = [g(x) for x in results.values]    
    
    plt.figure(1, figsize=(8,8))

    # Model mean and standard deviation
    plt.subplot(211)
    plt.plot(X, [f([x]) for x in X], color='black')
    plt.plot(X, mean, label='GP')
    plt.fill_between(X, mean-std, mean+std, alpha=0.4)

    # Known results and next selected point
    plt.scatter(results['x_index'], results['g(x)'], color='black', label='known')
    plt.scatter(next_points['x_index'], next_points['g(x)'], color='red', label='next')
    plt.ylabel('f(x)')
    
    # Plot some posterior samples
    if plot_samples:
        samples = bo.obj.scaler.unstandardize(bo.model.sample_posterior(bo.obj.domain, batch_size=2))
        i = 1
        for sample in samples:
            plt.plot(X, sample.numpy(), '--', label='sample' + str(i))
            i += 1
    
    plt.legend(loc='lower left')

    # Plot the acquisition function
    plt.subplot(212)
    for p in bo.acq.function.projections:
        plt.plot(bo.obj.domain['x'], p)

    plt.xlabel('x')
    plt.ylabel('Acquisition Function')
    
    if export_path is not None:
        plt.savefig(export_path, format='svg', dpi=1200, bbox_inches='tight')
    
    plt.show()

Initialization methods

Suppose we have no data and want to start by selecting initial experiments to run. We can do this at random or by using clustering methods using EDBO. I have also written some DOE add on modules which enable you to use response surface (e.g., central composite) and fractional factorial designs. However, these are not included in EDBO 0.0.0. Here we use the centroids from k-Means clustering for initialization.

from edbo.bro import BO_express

# (1) Define a dictionary of components
components = {'x':X}

# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}

# (3) Instatiate BO object
bo = BO_express(components,
                encoding,
                batch_size=2,
                target='g(x)',
                init_method='kmeans')

# (4) Choose initial experiments using k-means
bo.init_sample()

print('\nNormalized domain points:')
bo.proposed_experiments

Normalized domain points:

	x
252	0.252252
751	0.751752

We can get the unnormalized experiments (or SMILES strings etc.) using the get_experiments method.

print('\nDomain points:')
bo.get_experiments()

Domain points:

	x_index
252	2.52252
751	7.51752

And we can plot the choices on the domain.

plt.figure(figsize=(6,1))
plt.scatter(bo.obj.domain['x'], np.ones((len(bo.obj.domain))))
plt.scatter(bo.proposed_experiments, np.ones((len(bo.proposed_experiments))), s=100)
plt.xlabel('x')
plt.yticks([])
plt.show()

Human-in-the-loop optimization

Now we can move on to the optimization. If you were really running experiments in the lab you would likely just want to use the run method to iteratively choose experiments. Then go into the lab run the experiments, collect the results, and read them back into the optimizer. Let’s see what that would look like. First lets export the proposed experiments to a CSV file so we can add the results after we “run” the experiments.

# Without an arguement this will export 'experiments.csv' to the cwd.
bo.export_proposed()

Since this is actually a computational objective we can “run” the experiments right here.

# "Run" the experiments
expts = bo.get_experiments()
expts['g(x)'] = [g(x) for x in expts.values]

# Save the results as a CSV
expts.to_csv('results.csv')

# Load the results
bo.add_results('results.csv')

Then in order to choose the next experiments we simply use the run method.

bo.run()

And we can return basic analysis of the acquisition process using the acquisition_summary method.

bo.acquisition_summary()

	x	predicted g(x)	variance
960	0.960961	60.7232	1581.51
631	0.631632	64.5289	969.982

You can continue this process iteratively until the objective is maximized or you run out of resources. We can get an idea of what is going on under the hood using our plotting function. In the top plot notice that the model mean fits the experimental results well and that the model confidence region (2$\sigma$) capture the unknown objective. As a result, when we sample the posterior predictive distribution of the model you can see that one of the random functions (yellow dashed) actually captures most of the variation in the objective. The default acquisition function used by EDBO is parallel expected improvement (EI). The computed EI, used to select the next round of experiments, is shown in the bottom plot. Notice that the ArgMax of the acquisition function gives the next two experiments (red points).

Automated optimization

Given that $f$ is actually a computational objective, we could just use EDBO to automatically optimize the objective. Below is some sample code for how you can do this using the computational objective option.

# EDBO works on the normalized search space
# We need a new function that maps to the real domain
def h(x):
    """Deal with scaling."""
    
    i = np.argwhere(bo.obj.domain.values == x).flatten()[0]
    df = bo.reaction.get_experiments(i)
    
    return g(df.values)

# Use the computational_objective arguement 
bo = BO_express(components,
                encoding,
                batch_size=2,
                target='g(x)',
                computational_objective=h)

# Run the optimization automatically using simulate
bo.simulate(seed=4, iterations=5)

# Plot the results
plot_results()

Configuring the optimizer

Models. In Bayesian optimization, the surrogate model type defines a prior over functions which capture our assumptions about the shape of a the response surface. When we combined this prior with observed reaction data we then get a posterior distribution of functions which we can use to reason about the possible positions of global optima. Practically speaking, many acquisition functions (but not all, e.g., Thompson Sampling) are formulated from the surrogate models mean and variance. Thus, in principal any regression model can be employed in Bayesian optimization (e.g., by bootstrapping variance estimates). EDBO currently has three different surrogate models built into the edbo.models module: gaussian processes (edbo.models.GP_Model, GPyTorch), random forests (edbo.models.RF_Model, Scikit-Learn), and Bayesian linear regression (edbo.models.Bayesian_Linear_Model, Scikit-Learn). See the edbo.models documentation page for more details. We can get an idea of the shape of these functions using the plotting method we wrote above (vide infra). It is also straightforward to implement your own model - see the edbo.models module for examples. Below is an example code block for utilizing a random forest model instead of the default gaussian process.

from edbo.bro import BO_express
from edbo.models import RF_Model

# (1) Define a dictionary of components
components = {'x':X}

# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}

# (3) Instatiate BO object
bo = BO_express(components,
                encoding,
                model=RF_Model)

Gaussian Process Regression (EDBO’s default model):

Random Forest Regression:

Bayesian Linear Regression:

BLM

Acquisition functions. The acquisition function is the algorithm responsible for selecting the next experiments to run based on the information captured by the surrogate model. Most acquisition functions are built to balance the exploration of the search space with the exploitation of information availible from evaluated experiments. EDBO has several acquisition functions availible via keyword arguements from the BO and BO_express classes. A full list can be found the the documentation. The default acquisition function, expected improvement, is derived from the expectation value of the improvement utility function. Below is an example code block for choosing different acquisition functions and a few examples of parallel acquisition functions which utilize the Kriging Believer algorithm for batching.

from edbo.bro import BO_express

# (1) Define a dictionary of components
components = {'x':X}

# (2) Define a dictionary of desired encodings
encoding={'x':'numeric'}

# (3) Instatiate BO object
bo = BO_express(components,
                encoding,
                acquisition_function='UCB')

Expected improvement (EDBO’s default acquisition function):

Probability of improvement:

Upper confidence bound:

Mean maximization (pure exploitation):

Variance maximization (pure exploration):

Analysis

During optimization we can run misc analysis using some of EDBO’s built in functions. For example, we can plot the optimizers path.

bo.plot_convergence()

And we can evaluate how well the model fits the experimental data.

bo.model.regression()

Finally, note that if you need help EDBO has a basic BOT which can run most of its methods. You can call the BOT using the help method. For example, if you wanted to save your workspace for later.

bo.help()

This will span an interactive session:

edbo bot: What can I help you with?
~  Save my workspace

edbo bot: Can you clarify: pickle BO object for later, export proposed, or exit?
~  pickle it

edbo bot: Save instace? (yes or no) You can load instance later with edbo.BO_express.load().
~  yes

edbo bot: Saving edbo.BO instance...


edbo bot: What can I help you with?
~  exit

edbo bot: Exiting...

Up next

EDBO and the main BO classes have a lot more features but hopefully this gives you an idea of how it could be used. In the next post we will see how to apply EDBO to chemical reaction data.

Bayesian Reaction Optimization Using EDBO - Part I

2 minute read

Published: September 30, 2020

Recently, in collaboration with folks over at Princeton and Bristol Myers Squibb, I finished writing a python package called Experimental Design via Bayesian Optimization (EDBO) for reaction optimization which enables the application of Bayesian optimization, an uncertainty guided response surface method, to chemical reactions in the laboratory. Now, the paper is submitted for publication and under review so I have not yet made the repository public. However, to facilitate training and beta testing I am writing a few preliminary posts on (1) installation and basic software usage, (2) simulations with real chemical reaction data, (3) using EDBO in the lab, and (4) tackling computational optimization problems.

Reference: Shields, Benjamin J.; Stevens, Jason; Li, Jun; Parasram, Marvin; Damani, Farhan, Martinez Alvarado, Jesus; Janey, Jacob; Adams, Ryan P.; Doyle, Abigail G. “Bayesian Reaction Optimization as A Tool for Chemical Synthesis” Manuscript Submitted.

Part I - Installation

Ok boring stuff first. In this post we will be tackling software installation from the code in my private repository (so no Git, PyPI, or Anaconda for now).

Install conda

If you haven’t already installed anaconda (or miniconda) on your machine you can follow the instructions provided by conda.

Install EDBO

Windows Script

I wrote a shell script (install.sh) to install EDBO on windows machines. You will find a copy in the edbo.zip folder provided.

Download and unzip the folder.
Open an anaconda prompt, navigate to the edbo directory, and run the script.

cd path/to/edbo/directory
sh install.sh

Mac/Linux Script

I wrote a slightly different shell script (install_mac.sh) to install EDBO on Mac/Linux machines. You will find a copy in the edbo.zip folder provided.

Download and unzip the folder.
Open a terminal and create a conda environment for EDBO.

conda create -y --name edbo python=3.7.5
conda activate edbo

Navigate to the edbo directory and run the script.

cd path/to/edbo/directory
sh install_mac.sh

Software tests

Use the pytest framework to run some basic software tests to make sure the installation worked. In the anaconda prompt (or terminal for Mac/Linux) navigate to the folder containing edbo. Then run the following commands and you will see test logs appear in the testing directory. These may take a few min to run and you should see some warnings but no failed tests. If you do please let me know so I can fix the issue and update the software.

conda activate edbo
cd tests
sh basic_tests.sh

Up next

That wraps up this post. In Part II we will walk through a basic introduction to the software.

Automatic Design of SARS-CoV-2 M^pro Inhibitors via Machine Learning & Molecular Docking

22 minute read

Published: August 09, 2020

Benjamin J. Shields, Ph.D.

Bayesian Reaction Optimization Using EDBO - Part III

Part III - Bayesian Reaction Optimization

The objective

Where to search?

Experimental capabilities

Reaction optimization

Encoding the reaction space

Running the optimizer

Up next

Share on

You May Also Enjoy

Bayesian Reaction Optimization Using EDBO - Part IV

Part IV - Bayesian Reaction Optimization Workshop

Bayesian Reaction Optimization Using EDBO - Part II

Part II - Software introduction

Define Objective and Search Space

Using EDBO

Initialization methods

Human-in-the-loop optimization

Automated optimization

Configuring the optimizer

Analysis

Up next

Bayesian Reaction Optimization Using EDBO - Part I

Part I - Installation

Install conda

Install EDBO

Software tests

Up next

Automatic Design of SARS-CoV-2 M^pro Inhibitors via Machine Learning & Molecular Docking

Introduction

Benjamin J. Shields, Ph.D.

Part III - Bayesian Reaction Optimization

The objective

Where to search?

Experimental capabilities

Reaction optimization

Encoding the reaction space

Running the optimizer

Up next

Share on

You May Also Enjoy

Bayesian Reaction Optimization Using EDBO - Part IV

Part IV - Bayesian Reaction Optimization Workshop

Bayesian Reaction Optimization Using EDBO - Part II

Part II - Software introduction

Define Objective and Search Space

Using EDBO

Initialization methods

Human-in-the-loop optimization

Automated optimization

Configuring the optimizer

Analysis

Up next

Bayesian Reaction Optimization Using EDBO - Part I

Part I - Installation

Install conda

Install EDBO

Software tests

Up next

Automatic Design of SARS-CoV-2 Mpro Inhibitors via Machine Learning & Molecular Docking

Introduction

Automatic Design of SARS-CoV-2 M^pro Inhibitors via Machine Learning & Molecular Docking